The data for this analysis comes from #tidytuesday dataset on cocktail recipes. We’ll perform different unsupervised dimensionality reduction methods on different cocktail recipes. As mentioned in the source website, we need to pay attention to the measure column.
boston_cocktails <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv')
boston_cocktails %>% count(ingredient, sort = TRUE)
## # A tibble: 569 x 2
## ingredient n
## <chr> <int>
## 1 Gin 176
## 2 Fresh lemon juice 138
## 3 Simple Syrup 115
## 4 Vodka 114
## 5 Light Rum 113
## 6 Dry Vermouth 107
## 7 Fresh Lime Juice 107
## 8 Triple Sec 107
## 9 Powdered Sugar 90
## 10 Grenadine 85
## # … with 559 more rows
Looking at the ingredient column, we may need to do some more data clean up. Some terms are in lower cases, upper cases, etc which make this column inconsistent. Furthermore, we will look at the measure column.
cocktails_parsed <- boston_cocktails %>%
mutate(
ingredient = str_to_lower(ingredient),
ingredient = str_replace_all(ingredient, "-", " "),
ingredient = str_remove(ingredient, " liqueur$"),
ingredient = str_remove(ingredient, " (if desired)$"),
ingredient = case_when(
str_detect(ingredient, "bitters") ~ "bitters",
str_detect(ingredient, "lemon") ~ "lemon juice",
str_detect(ingredient, "lime") ~ "lime juice",
str_detect(ingredient, "grapefruit") ~ "grapefruit juice",
str_detect(ingredient, "orange") ~ "orange juice",
TRUE ~ ingredient
),
measure = case_when(
str_detect(ingredient, "bitters") ~ str_replace(measure, "oz$", "dash"), # make assumption on bitters measurement, there may be some error on collecting this bitter measurement
TRUE ~ measure
),
measure = str_replace(measure, " ?1/2", ".5"),
measure = str_replace(measure, " ?3/4", ".75"),
measure = str_replace(measure, " ?1/4", ".25"),
measure_number = parse_number(measure),
measure_number = if_else(str_detect(measure, "dash$"),
measure_number / 50,
measure_number
)
) %>%
add_count(ingredient) %>%
filter(n > 15) %>%
select(-n) %>%
distinct(row_id, ingredient, .keep_all = TRUE) %>%
na.omit()
cocktails_parsed
## # A tibble: 2,542 x 7
## name category row_id ingredient_numb… ingredient measure measure_number
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 Gauguin Cocktail … 1 1 light rum 2 oz 2
## 2 Gauguin Cocktail … 1 3 lemon jui… 1 oz 1
## 3 Gauguin Cocktail … 1 4 lime juice 1 oz 1
## 4 Fort La… Cocktail … 2 1 light rum 1.5 oz 1.5
## 5 Fort La… Cocktail … 2 2 sweet ver… .5 oz 0.5
## 6 Fort La… Cocktail … 2 3 orange ju… .25 oz 0.25
## 7 Fort La… Cocktail … 2 4 lime juice .25 oz 0.25
## 8 Cuban C… Cocktail … 4 1 lime juice .5 oz 0.5
## 9 Cuban C… Cocktail … 4 2 powdered … .5 oz 0.5
## 10 Cuban C… Cocktail … 4 3 light rum 2 oz 2
## # … with 2,532 more rows
cocktails_df <- cocktails_parsed %>%
select(-ingredient_number, -row_id, -measure) %>%
pivot_wider(names_from = ingredient, values_from = measure_number, values_fill = list(measure_number = 0)) %>%
janitor::clean_names() %>%
na.omit()
cocktails_df
## # A tibble: 937 x 42
## name category light_rum lemon_juice lime_juice sweet_vermouth orange_juice
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Gaug… Cocktai… 2 1 1 0 0
## 2 Fort… Cocktai… 1.5 0 0.25 0.5 0.25
## 3 Cuba… Cocktai… 2 0 0.5 0 0
## 4 Cool… Cocktai… 0 0 0 0 1
## 5 John… Whiskies 0 1 0 0 0
## 6 Cher… Cocktai… 1.25 0 0 0 0
## 7 Casa… Cocktai… 2 0 1.5 0 0
## 8 Cari… Cocktai… 0.5 0 0 0 0
## 9 Ambe… Cordial… 0 0.25 0 0 0
## 10 The … Whiskies 0 0.5 0 0 0
## # … with 927 more rows, and 35 more variables: powdered_sugar <dbl>,
## # dark_rum <dbl>, cranberry_juice <dbl>, pineapple_juice <dbl>,
## # bourbon_whiskey <dbl>, simple_syrup <dbl>, cherry_flavored_brandy <dbl>,
## # light_cream <dbl>, triple_sec <dbl>, maraschino <dbl>, amaretto <dbl>,
## # grenadine <dbl>, apple_brandy <dbl>, brandy <dbl>, gin <dbl>,
## # anisette <dbl>, dry_vermouth <dbl>, apricot_flavored_brandy <dbl>,
## # bitters <dbl>, straight_rye_whiskey <dbl>, benedictine <dbl>,
## # egg_white <dbl>, half_and_half <dbl>, vodka <dbl>, grapefruit_juice <dbl>,
## # blended_scotch_whiskey <dbl>, port <dbl>, white_creme_de_cacao <dbl>,
## # citrus_flavored_vodka <dbl>, whole_egg <dbl>, egg_yolk <dbl>,
## # blended_whiskey <dbl>, dubonnet <dbl>, blanco_tequila <dbl>,
## # old_mr_boston_dry_gin <dbl>
We can use the commonly used prcomp for PCA analysis, but we’ll follow Julia Silge’s approach which uses tidymodels. This is also easily extendable/usable for other models.
library(tidymodels)
pca_rec <- recipe(~ ., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## id 2
## predictor 40
##
## Operations:
##
## Centering and scaling for all_predictors
## No PCA components were extracted.
pca_prep <- prep(pca_rec)
pca_prep
## Data Recipe
##
## Inputs:
##
## role #variables
## id 2
## predictor 40
##
## Training data contained 937 data points and no missing data.
##
## Operations:
##
## Centering and scaling for light_rum, lemon_juice, ... [trained]
## PCA extraction with light_rum, lemon_juice, ... [trained]
tidied_pca <- tidy(pca_prep, 2)
tidied_pca %>%
filter(component %in% paste0("PC", 1:5)) %>%
mutate(component = fct_inorder(component)) %>%
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y =NULL)
- PC1: powdered sugar vs. simple syrup; recipes are not likely to have both. Let’s zoom in on the first four components, and understand which cocktail ingredients contribute in the positive and negative directions.
# need to reorder again, so it'll be sorted based on values
tidied_pca %>%
filter(component %in% paste0("PC", 1:4)) %>%
group_by(component) %>%
top_n(8, abs(value)) %>%
ungroup() %>%
ggplot(aes(reorder(terms, abs(value)), abs(value), fill = value > 0)) +
geom_bar(stat="identity") +
coord_flip() +
facet_wrap(~component, scales = "free_y") +
labs(y = NULL, x = NULL, fill = "Positive?")
- PC1: powdered sugar + egg + gin drinks vs. simple syrup + lime + tequila drinks. This is the component that explains the most variation in drinks. - PC2: mostly about vermouth, both sweet and dry.
We can then observe more specifically PC1 and PC2.
juice(pca_prep) %>%
ggplot(aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)
- Fizzy, egg, powdered sugar drinks are to the left. - Simple syrup, lime, tequila drinks are to the right. - Vermouth drinks are more to the top.
Let’s look at PC2 and PC3.
juice(pca_prep) %>%
ggplot(aes(PC2, PC3, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)
It’s less obvious as compared to PC1 and PC2. How about PC1 and PC3?
juice(pca_prep) %>%
ggplot(aes(PC1, PC3, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)
Most of cocktail classics are found on the left. We can explore further on different combinations of principal components.
library(embed)
umap_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_umap(all_predictors())
umap_prep <- prep(umap_rec)
umap_prep
## Data Recipe
##
## Inputs:
##
## role #variables
## id 2
## predictor 40
##
## Training data contained 937 data points and no missing data.
##
## Operations:
##
## Centering and scaling for light_rum, lemon_juice, ... [trained]
## UMAP embedding for light_rum, lemon_juice, ... [trained]
We can also do similar analysis just like what we had for PCA i.e. analysing the different principal components (UMAP components here).
juice(umap_prep) %>%
ggplot(aes(umap_1, umap_2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)
t-SNE is also another increasingly popular method to do dimensionality reduction. Inspired by Joshua Cook’s analysis, we’ll use Rtsne library to run t-SNE analysis.
library(Rtsne)
tsne_cocktails <- Rtsne(cocktails_df, perplexity = 5, check_duplicates = FALSE)
tsne_cocktails_tib <- as_tibble(tsne_cocktails$Y) %>%
set_names(c("z1", "z2")) %>%
mutate(category = cocktails_df$category)
Let’s visualize our t-SNE results
tsne_cocktails_summary <- tsne_cocktails_tib %>%
group_by(category) %>%
summarise(avg_z1 = mean(z1), avg_z2 = mean(z2)) %>%
ungroup()
tsne_cocktails_tib %>%
ggplot(aes(z1, z2)) +
geom_point(aes(color = category)) +
geom_label(aes(label = category, x = avg_z1, y = avg_z2),
data = tsne_cocktails_summary,
label.size = 0,
label.padding = unit(1, "mm"),
fill = "white",
fontface = "bold",
alpha = 0.8) +
#scale_color_brewer(palette = "Set2") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "none"
) +
labs(x = "Dim 1",
y = "Dim 2",
title = "PCA and t-SNE of drinks by ingredients")
We can observe some clear separate clusters, but some categories like Vodka should not be close to Non-alcoholic drinks. We can tune around the different parameters that t-SNE algorithm has such as its perplexity, but what we currently have is good enough for basic data exploratory analysis.